Work as part of the Collaborative Research Centre 990: Ecological and Socioeconomic Functions of Tropical Lowland Rainforest Transformation Systems (Sumatra, Indonesia)


1 Introduction

1.1 Previous studies

Many studies attempted to produce land-use/land-cover maps for Jambi province. Nevertheless, these maps do not meet project demands for different reasons:

  • do not cover entire province
  • classes are delineated visually with coarse resolution
  • are outdated

e.g. (Ekadinata and Vincent 2011; Melati 2017; Nurwanda, Zain, and Rustiadi 2016; Soo Chin Liew, Chew Wai Chang, and Kim Hwa Lim 2003)


1.2 Objectives of this study

In order to serve as reference for further research questions in the project a number of demands have been defined, the produced map should:

  • cover the entire province
  • be up-to-date (ideally 2018/19)
  • have similar classes to other classifications to make it comparable
  • have a focus on oil palm plantations
  • have a reasonable accuracy

Ideally a similar map with same classes can be produced for past years to:

  • enable change analysis
  • detect hotspots of forest conversion
  • make predictions for future development

2 Methods

The classification process is done with Google Earth Engine and R. While the reference data was delineated in GEE, images from Bing Maps and Google Earth Pro were also used for visual interpretation of lulc-classes. As model tuning is not possible in GEE the reference data was exported to R in order to find the best model parameters. The classification itself was then done in GEE, using model parameters defined in R.

Reference Data Model Tuning Classification

           


2.1 Reference Data

2.1.1 LuLc-Classes in previous studies

Previous studies on land-cover and land-use in Jambi used different classification keys. Some classes were classified in sub-categories in other studies, while other classes were merged or were not assessed.
Melati (2017) used different classification schemes for different purposes (areas) by merging the 22 classes differentiated by the Ministry of Forestry. Stolle et al. (2003) also used many classes, while Sambodo and Indriasari (2018), Nurwanda, Zain, and Rustiadi (2016), and Ekadinata and Vincent (2011) merged some of the classes into broarder categories.

classes <- read_csv(here("raw_data/classification-classes.csv"), na = "NA")
dt_classes <- classes 

names(dt_classes)[1] <- paste0(names(dt_classes)[1], 
                                footnote_marker_symbol(1))
dt_classes %>%
  kable(escape = F) %>%
  kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), font_size = 12, fixed_thead = T) %>%
  column_spec(1:9, width_min = "15em", italic = TRUE) %>%
  collapse_rows(columns = 1:9, valign = "top") %>%
  scroll_box(width = "100%") %>%
  footnote(symbol =  "as used in Melati (2017)" )
Ministry of Forestry* Melati (2017) I Melati (2017) II Melati (2017) III Stolle et al. (2003) Sambodo et al. (2018) Nurwanda et al. (2016) Ekadinata et al. (2011) this study
Primary dryland forest NA NA Primary forest Natural vegetation Forest Forest Forest Primary forest
Primary swamp forest NA NA
Primary mangrove forest NA NA
Secondary dryland forest Secondary dryland forest Secondary forest Secondary forest Logged-over forest Secondary forest
Secondary swamp forest Secondary swamp forest
Secondary mangrove forest Secondary mangrove forest Mangrove + Shrubs with trees
Jungle rubber Jungle rubber Jungle rubber Jungle rubber Small-holder rubber Rubber Rubber Rubber agroforest
Rubber Plantation Rubber plantation Rubber plantation Rubber plantation Rubber plantation Monoculture rubber Plantation forest
Plantation forest Plantation forest ? Plantation forest Tree crop mosaic ? ? ?
Oil palm plantation Oil palm plantation Oil palm plantation Oil palm plantation Oil palm plantation Oil palm + Coconut plantations Oil palm plantation Oil palm plantation Immature oil palm plantation
Mature oil palm plantation
Oil/rubber plantation
NA NA NA NA Coconut plantation NA NA Coconut plantation
Agriculture Mixed dryland agriculture NA Agriculture Tea plantation Cropland Mixed dryland agriculture NA Tea plantation
NA Food crop mosaic NA NA
Dryland agriculture NA Secondary/food mosaic NA NA
Paddy field NA Sawah or tidal rice NA Rice field Rice
Shrub/bush Shrub/bush Shrub/bush Shrub/bush Others Shrub Shrub Shrub Shrub/bush
Swamp bush Swamp bush
Bare land Bare land Bare land Others Bare soil Bare land Cleared land Burned/Cleared
Mining Mining
Water bodies Water body Water body Water Water body NA Water
Fishponds NA
Transmigration areas Settlement Settlement Settlement Built-up area Settlement Urban
Airport
Settlements Homesteads
* as used in Melati (2017)

2.1.2 LuLc-Classes in this study

As Melati (2017) suggests oil palm plantations were differentiated in mature and immature plantations. Class Rubber was not classified seperately since other studies had no success trying this (e.g. Melati (2017)) and because rubber trees could not be differentiated visually in high resolution imagery for reference data creation.

water

primary forest

secondary forest

oilpalm mature

oilpalm immature

bush shrub

plantation forest

burned cleared

urban buildings

coconut plantation

rice

tea plantation

2.1.3 Data collection

Reference data is delineated in GEE with support of high resolution imagery from GE Pro and Bing Maps. The following conditions were met for data collection:

  • medium to large sized polygons
    (more efficient but also less distributed with higher intracorrelation)
  • cover only one lulc-class
    (avoid mixed pixels and false references)
  • based on high quality images from 2018/19
    (up-to-date mapping)
  • have a low chance of further lulc-changes recently
    (avoid false references from changed classes)
  • distributed in the entire region
    (cover regional variability)
sample of reference data collected in GEE

sample of reference data collected in GEE

see in GEE


2.2 Classification

A supervised classification algorithm with good performance should be used in combination with good predictor variables.

input variables

A combination of Sentinel-1 (radar) and Sentinel-2 (optical) imagery is used

classification model

Decision tree models are used widely for classification problems, preliminary tests indicated that Random Forest (RF) algorithm performed better than Classification And Regression Trees (CART). Hence, a RF model was trained to classify Land-use and Land-cover.

classification key


2.2.1 Model Tuning

2.2.1.1 Gridsearch Results

library(readr)
library(tidyverse)
library(caret)
library(kableExtra)

# read reference data exported by google earth engine
ref <- read_csv(here("raw_data/reference/stratified-reference_20190521.csv"),
                col_types = cols(.geo = col_skip(), 
                                 class = col_factor(levels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "10", "11", "12")), 
                                 latitude_209564535 = col_skip(), 
                                 longitude_209564535 = col_skip(), 
                                 `system:index` = col_skip()))

# split in train and test data
index <- createDataPartition(y = ref$class, p = .7, list = FALSE)
training <- ref[index, ]
testing <- ref[-index, ]
# create random forest model
filename = here("output_data/model/rf_fit.rds")
if (file.exists(filename)){
  
  rf_fit <- readRDS(filename)
  
} else{
  
  # specify that the resampling method is 
  fit_control <- trainControl(## 10-fold CV
    method = "cv",
    number = 10)
  
  # define a grid of parameter options to try
  rf_grid <- expand.grid(mtry = c(2, 3, 4, 6, 8),
                         splitrule = c("gini"),
                         min.node.size = c(1, 5, 10))
  
  # run a random forest model
  set.seed(825)
  library(doParallel)
  cl <- makePSOCKcluster(4)
  registerDoParallel(cl)
  
  # fit the model with the parameter grid
  rf_fit <- train(class ~ ., 
                  data = training, 
                  method = "ranger",
                  importance = "impurity" ,
                  trControl = fit_control,
                  # provide a grid of parameters
                  tuneGrid = rf_grid)
  
  stopCluster(cl)
  
  # save model
  saveRDS(rf_fit, filename)
  
}
print(rf_fit)
## Random Forest 
## 
## 8400 samples
##   19 predictor
##   12 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '10', '11', '12' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 7560, 7560, 7560, 7560, 7560, 7560, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa    
##   2      1             0.9273810  0.9207792
##   2      5             0.9261905  0.9194805
##   2     10             0.9238095  0.9168831
##   3      1             0.9285714  0.9220779
##   3      5             0.9273810  0.9207792
##   3     10             0.9257143  0.9189610
##   4      1             0.9275000  0.9209091
##   4      5             0.9261905  0.9194805
##   4     10             0.9248810  0.9180519
##   6      1             0.9246429  0.9177922
##   6      5             0.9267857  0.9201299
##   6     10             0.9241667  0.9172727
##   8      1             0.9241667  0.9172727
##   8      5             0.9261905  0.9194805
##   8     10             0.9235714  0.9166234
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = gini
##  and min.node.size = 1.
# ggplot(data= rf_fit$results) + 
#   geom_line(aes(x=mtry, y=Accuracy, color= as.factor(min.node.size))) + 
#   geom_point(aes(x=mtry, y=Accuracy, color= as.factor(min.node.size)), size = 3) +
#   scale_color_viridis(discrete = TRUE, option = "D", name="Minimal Node Size") +
#   labs(x = "Randomly Selected Predictors",
#        y = "Accuracy (Cross-Validation)") +
#     theme_classic() +
#   theme(legend.position="bottom") 

rf_fit$results %>%
  plot_ly(x= ~mtry) %>%
  add_trace(y = ~Accuracy, 
            mode = 'lines+markers', 
            color= ~ as.factor(min.node.size), colors = viridis_pal(option = "D")(3),
            error_y = ~list(array = AccuracySD),
            text = ~paste("Accuracy: ", Accuracy, "<br>AccuracySD:", AccuracySD) 
            ) %>%
  layout(xaxis = list(title="Randomly Selected Predictors"),
         yaxis = list(title="Accuracy (Cross-Validation)")) 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

Gridsearch result for number of predictor variables and minimal node size

Differences are not very pronounced.


2.2.1.2 Model Simplification

simplest model with high accuracy (max 2% difference to best model)

# get simplest model with similar accuracy
whichTwoPct <- tolerance(rf_fit$results, metric = "Accuracy", 
                         tol = 2, maximize = TRUE)  

rf_fit$results[whichTwoPct,]
##   mtry splitrule min.node.size Accuracy     Kappa  AccuracySD     KappaSD
## 1    2      gini             1 0.927381 0.9207792 0.008566283 0.009345036

2.2.1.3 Variable Importance

In model fitting a relative variable importance is calculated to give an impression which predictor variables are valuable and which are less valuable for the prediction process. Nevertheless, correlation between variables is not taken into account.

# variable importance
imp <- varImp(rf_fit)

plot(imp)
Relative variable importance

Relative variable importance


2.2.1.4 Model Validation

The validation data is classified with the best model obtained by gridsearch. Its confusion matrix is calculated as a value of prediction accuracy.


Error Matrix

# validation
testing.pred <- predict(rf_fit, newdata = testing[1:19], probability= F)
cm<- confusionMatrix(data = testing.pred, reference = testing$class)

cm.df <- as.data.frame.matrix(cm$table)

cm.df %>%
  
  mutate_all(~ifelse(. > max(cm.df[1:12])*0.75,
                  cell_spec(., "html", color = "white", bold = T, background = "green"),
                  ifelse(. > 0,
                         cell_spec(., color = "white", bold = T, background = spec_color(., option = "A", direction= -1, begin = 0.3, scale_from = c(0,53))),
                         cell_spec(., "html", color = "grey")
                         )
                  )
             ) %>%
  
  mutate(class = cell_spec(c("water - 0","primary forest - 1","secondary forest - 2","mature oilpalm - 3", "immature oilpalm - 4","shrubland - 5","plantation forest - 6","burned / cleared - 7","urban buildings - 8","coconut plantation - 10","rice - 11","tea plantation - 12"), "html",bold = T)) %>%
  select(class, everything(.)) %>%
  kable("html", escape = F, rownames = T, align=c('r', rep('c', 12))) %>%
  kable_styling("hover", full_width = F)
class 0 1 2 3 4 5 6 7 8 10 11 12
water - 0 300 0 0 0 0 0 0 0 0 0 0 0
primary forest - 1 0 282 9 1 1 0 3 0 0 4 0 0
secondary forest - 2 0 8 284 0 0 0 2 0 0 3 0 0
mature oilpalm - 3 0 1 1 293 6 0 0 0 0 1 0 0
immature oilpalm - 4 0 0 0 2 293 0 0 0 1 2 0 0
shrubland - 5 0 0 3 2 0 300 0 0 0 1 0 0
plantation forest - 6 0 0 0 0 0 0 294 0 0 0 0 0
burned / cleared - 7 0 0 0 0 0 0 1 298 2 0 0 0
urban buildings - 8 0 0 0 0 0 0 0 0 297 0 0 0
coconut plantation - 10 0 9 3 2 0 0 0 0 0 289 0 0
rice - 11 0 0 0 0 0 0 0 2 0 0 300 0
tea plantation - 12 0 0 0 0 0 0 0 0 0 0 0 300


Respective Accuracy

print(cm$overall[1:4])
##      Accuracy         Kappa AccuracyLower AccuracyUpper 
##     0.9805556     0.9787879     0.9754962     0.9848115

2.2.2 Feature Selection

To further simplify the prediction model a Recursive Feature Elimitaion (rfe) is applied. This will eliminate worst performing predictor variables (chosen by importance) at each step and keep the best performing variables.

# perform reverse feature selection with all variables
filename = here("output_data/model/rfProfile_all.rds")
if (file.exists(filename)){
  
  rfProfile <- readRDS(filename)
  
  } else{
    # normalize data
    training_recipe <- recipe(class ~ ., data = training) %>%
      step_center(all_predictors()) %>%
      step_scale(all_predictors()) %>%
      step_nzv(all_predictors()) 
    
    train_prepped <- 
      training_recipe %>% 
      prep(training) %>% 
      juice()
    
    # number of features to test
    subsets <- c(1:19)
    
    training_ctrl <- rfeControl(
      method = "repeatedcv",
      repeats = 5,
      functions = rfFuncs, 
      returnResamp = "all"
    )
    
    library(doParallel)
    cl <- makePSOCKcluster(4)
    registerDoParallel(cl)
    
    rfProfile <- rfe(x = train_prepped %>% dplyr::select(-class),
                     y = train_prepped$class,
                     sizes = subsets,
                     metric = "Accuracy",
                     rfeControl = training_ctrl)
    
    stopCluster(cl)
    
    # save model
    saveRDS(rfProfile, filename)
    
    }

2.2.2.1 Number of Features

The best model in regards to predictor variables uses 18 out of 19 variables, all except for VV_variance. However, one can see that the model performs equally good with less predictor variables.

Model performance by number of features evaluated with Recursive Feature Elimitaion

Chosen variables

##  [1] "VH"          "B5"          "VV_VH"       "B11"         "VV"         
##  [6] "B12"         "NBRI"        "NDVI"        "B4"          "NDMI"       
## [11] "B3"          "B6"          "SAVI"        "NDWI"        "B7"         
## [16] "VH_variance" "B8A"         "B8"

2.2.2.2 Model Simplification

To simplify the model without loosing prediction accuracy we search for a model with less predictor variables, which has the same accury (max 2 % difference in accuracy).

##   Variables  Accuracy     Kappa AccuracySD     KappaSD
## 8         8 0.9098095 0.9016104 0.00739575 0.008068091

A model using the following 8 prediction variables instead of all 18 variables has almost the same accuracy.

selectedVars <- rfProfile$variables
bestVar <- rfProfile$control$functions$selectVar(selectedVars, var_num)
bestVar
## [1] "VH"    "B5"    "VV_VH" "B11"   "VV"    "B12"   "NBRI"  "NDVI"

2.2.3 Final Model

Using the results from previous analysis we train a model with best performing predictor variables and model-hyperparameters.

The predictor variables are:

  • VH
  • B5
  • VV:VH
  • B11
  • VV
  • B12
  • NBRI
  • NDVI

The hyperparameters are:

  • Number of variables to possibly split at in each node (mtry) = 2
  • Minimal node size = 1
  • Number of trees = 500 (this was not optimized, as more trees usually give better results)

2.2.3.1 Model Validation

Applying the final model to the validation data set the error matrix looks like this.


Error Matrix

library(ranger)

f <- as.formula(paste("class", paste(bestVar, collapse=" + "), sep=" ~ "))

simpler_model <- ranger(formula = f, 
                        data = training,
                        num.trees = 500, 
                        mtry = 2,
                        min.node.size = 1,
                        importance = "impurity")

testing.pred <- predict(simpler_model, testing)

cm<- confusionMatrix(data = testing.pred$predictions, reference = testing$class)

cm.df <- as.data.frame.matrix(cm$table)

cm.df %>%
  
  mutate_all(~ifelse(. > max(cm.df[1:12])*0.5,
                  cell_spec(., "html", color = "white", bold = T, background = "green"),
                  ifelse(. > 0,
                         cell_spec(., color = "white", bold = T, background = spec_color(., option = "A", direction= -1, begin = 0.3, scale_from = c(0,53))),
                         cell_spec(., "html", color = "grey")
                         )
                  )
             ) %>%
  
 mutate(class = cell_spec(c("water - 0","primary forest - 1","secondary forest - 2","mature oilpalm - 3", "immature oilpalm - 4","shrubland - 5","plantation forest - 6","burned / cleared - 7","urban buildings - 8","coconut plantation - 10","rice - 11","tea plantation - 12"), "html",bold = T)) %>%
  select(class, everything(.)) %>%
  kable("html", escape = F, rownames = T, align=c('r', rep('c', 12))) %>%
  kable_styling("hover", full_width = F) 
class 0 1 2 3 4 5 6 7 8 10 11 12
water - 0 299 0 0 0 0 0 0 0 0 0 0 0
primary forest - 1 0 221 38 2 0 0 4 0 0 13 0 2
secondary forest - 2 0 49 229 0 1 0 3 0 0 24 0 0
mature oilpalm - 3 0 3 4 278 20 1 0 0 0 3 0 1
immature oilpalm - 4 0 1 1 14 273 1 0 1 1 1 1 2
shrubland - 5 0 0 5 2 0 284 1 0 0 2 0 3
plantation forest - 6 0 0 1 0 0 0 290 0 0 1 0 0
burned / cleared - 7 0 0 0 0 0 0 1 291 4 0 4 0
urban buildings - 8 0 0 0 0 0 0 0 0 293 0 1 0
coconut plantation - 10 0 26 22 3 6 2 1 1 0 255 0 0
rice - 11 1 0 0 0 0 4 0 7 2 1 293 0
tea plantation - 12 0 0 0 1 0 8 0 0 0 0 1 292


Respective Accuracy

print(cm$overall[1:4])
##      Accuracy         Kappa AccuracyLower AccuracyUpper 
##     0.9161111     0.9084848     0.9065748     0.9249651

2.2.4 Classification Reliability

The model validation provides us with values for overall model performance, but cannot give us regional performance values. Therefore a second classification was produced to inform about per pixel classification certainty. Since classification probability can only be calculated for a two-class classifier and the project focus is put on oilpalm plantations, the lulc-classes in the training dataset were merged into following classes:

  1. Oilpalm plantation (mature and immature)
  2. other classes

A classifier was then trained on this data with its output mode set to probability.

see in GEE


3 Results

see in GEE —————–

4 Problems

4.1 Classification

4.1.1 reference data

bad image quality

see in Google Maps

For a lot of areas in Jambi there is no appropriate reference image available. Problems, among others, are:
- no high resolution imagery is available
- high resolution imagery is covered by clouds
- last appropriate image is severeal years old

land-use mosaic 1

see in Google Maps

In many regions of Jambi province, the lulc-pattern is very heterogenious with many different classes densely mixed. In these cases it is hard to collect reference data because lulc areas covered by a single class are very small and chances are high to receive backscatter of mixed classes.

land-use mosaic 2

see in Google Maps

In some cases lulc classes are devided by other classes (e.g. intercropped with trees) which also makes it more difficult to collect proper reference data.

similar appearance

see in Google Maps

Some classes are hard to differentiate from each other in aerial images. For example:
- Primary forest - Secondary forest
- Secondary forest - Plantation forest
- Oil palm plantation - Coconut plantation

4.1.2 Processing

Although most processing steps were done in Google Earth Engine, using high performance server clusters via cloud computing, calculations were problematic. For the calculations vast amounts of data had to be processed. The area of interest (Jambi) is quite large covering many image tiles, for each tile data from almost 2 years was processed from different sources (Sentinel-1 and Sentinel-2) in many steps. Due to these complex processing steps calculations were aborted (timed-out) from time to time and other calculations took several days to complete before results could be evaluated and following steps could be initiated.


References

Ekadinata, Andree, and Grégoire Vincent. 2011. “Rubber agroforests in a changing landscape: Analysis of land use/cover trajectories in bungo district, indonesia.” Forests Trees and Livelihoods 20 (1): 3–14. doi:10.1080/14728028.2011.9756694.

Melati, Dian Nuraini. 2017. “The use of remote sensing data to monitor land use systems and forest variables of the tropical rainforest landscape under transformation in Jambi Province, Sumatra, Indonesia.” PhD thesis, Göttingen.

Nurwanda, Atik, Alinda Fitriany Malik Zain, and Ernan Rustiadi. 2016. “Analysis of Land Cover Changes and Landscape Fragmentation in Batanghari Regency, Jambi Province.” Procedia - Social and Behavioral Sciences 227 (August). Elsevier BV: 87–94. doi:10.1016/j.sbspro.2016.06.047.

Sambodo, Katmoko Ari, and Novie Indriasari. 2018. “LAND COVER CLASSIFICATION OF ALOS PALSAR DATA USING SUPPORT VECTOR MACHINE.” International Journal of Remote Sensing and Earth Sciences (IJReSES) 10 (1). Indonesian National Institute of Aeronautics; Space (LAPAN). doi:10.30536/j.ijreses.2013.v10.a1836.

Soo Chin Liew, Chew Wai Chang, and Kim Hwa Lim. 2003. “Hyperspectral land cover classification of EO-1 Hyperion data by principal component analysis and pixel unmixing.” In, 3111–3. Institute of Electrical; Electronics Engineers (IEEE). doi:10.1109/igarss.2002.1027101.

Stolle, F., K. M. Chomitz, E. F. Lambin, and T. P. Tomich. 2003. “Land use and vegetation fires in Jambi Province, Sumatra, Indonesia.” Forest Ecology and Management 179 (1-3): 277–92. doi:10.1016/S0378-1127(02)00547-9.

 

A work by Jens Wiesehahn